import pandas as pdimport numpy as npimport jsonfrom lets_plot import*import requestsLetsPlot.setup_html(isolated_frame=True)# Load JSON dataurl ='https://github.com/byuidatascience/data4missing/raw/master/data-raw/flights_missing/flights_missing.json'response = requests.get(url)raw_json = response.json()# Normalize to DataFramedf = pd.json_normalize(raw_json)
Elevator pitch
A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
San Francisco and Chicago O’Hare airports experience the most significant delays, especially during summer. September is the best month to fly for fewer delays. Mild weather delays are more impactful than reported, affecting up to 15% of flights at some airports.
QUESTION|TASK 1
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__
All inconsistent missing values were standardized to NaN for analysis. Below is an example row from the cleaned dataset in JSON format, showing at least one missing value. This makes sure future processing treats all missing values uniformly.
Show the code
# Task 1: Replace inconsistent missing valuesdf.replace(["", None, -999, "n/a"], np.nan, inplace=True)# Show example row as JSON (with NaN included)example_row = df.iloc[2].to_dict()json.dumps(example_row, indent=2, default=str)
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
San Francisco International and Chicago O’Hare had the highest average delays. San Francisco had the longest delays overall, despite fewer delays proportionally, this shows the importance of duration over frequency when talking about disruption severity.
Show the code
summary = df.groupby("airport_name").agg({"num_of_flights_total": "sum","num_of_delays_total": "sum","minutes_delayed_total": "sum"}).reset_index()summary["proportion_delayed"] = summary["num_of_delays_total"] / summary["num_of_flights_total"]summary["avg_delay_hours"] = summary["minutes_delayed_total"] / summary["num_of_flights_total"] /60# Sort by average delaysummary_sorted = summary.sort_values("avg_delay_hours", ascending=False).reset_index(drop=True)# Display top 5 worst airportssummary_sorted.head()
airport_name
num_of_flights_total
num_of_delays_total
minutes_delayed_total
proportion_delayed
avg_delay_hours
0
San Francisco, CA: San Francisco International
1565257
408631
25488636
0.261063
0.271400
1
Chicago, IL: Chicago O'Hare International
3400032
773122
52165135
0.227387
0.255709
2
Atlanta, GA: Hartsfield-Jackson Atlanta Intern...
4235114
870910
52114971
0.205640
0.205091
3
Washington, DC: Washington Dulles International
773480
152630
9322510
0.197329
0.200878
4
Denver, CO: Denver International
2323376
439964
23660463
0.189364
0.169728
QUESTION|TASK 3
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
September had the lowest average delays, while July and December had the worst. Flying in early fall is best for minimizing disruption.
According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
a. 30% of all delayed flights in the Late-Arriving category are due to weather
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
Missing values in the Late Aircraft field were filled with the column mean. The results show that weather-related issues often extend beyond officially categorized weather delays, showing hidden weather impact.
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
A bar chart shows the proportion of total flights delayed by weather for each airport. Airports like Atlanta and Denver had high weather-related delay rates. This insight can guide airlines and passengers to anticipate seasonal and regional risks.
Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.
Carrier delays are the most frequent, suggesting airline-specific issues are the leading cause of disruptions.
Weather is the second most common.
Security delays don’t really happen.
Show the code
categories = ["minutes_delayed_weather", "minutes_delayed_carrier", "minutes_delayed_security"]delay_summary = df[categories + ["num_of_flights_total"]].copy()for col in categories: delay_summary[col] = delay_summary[col].fillna(0) delay_summary[col] = delay_summary[col] / delay_summary["num_of_flights_total"]avg_delays = delay_summary[categories].mean().reset_index()avg_delays.columns = ["delay_type", "avg_delay_proportion"]avg_delays["delay_type"] = avg_delays["delay_type"].str.replace("minutes_delayed_", "").str.capitalize()ggplot(avg_delays, aes(x="delay_type", y="avg_delay_proportion")) +\ geom_bar(stat="identity", fill="#FF7F50") +\ labs(title="Average Proportion of Flights Delayed by Type", x="Delay Type", y="Proportion of Flights Delayed")